Removing Duplicate URLs based on URL Normalization and Query Parameter
نویسندگان
چکیده
منابع مشابه
Query Based Duplicate Data Detection on WWW
The problem of finding relevant documents has become much more prominent due to the presence of duplicate data on the WWW. This redundancy in results increases the users’ seek time to find the desired information within the search results, while in general most users just want to cull through tens of result pages to find new/different results. The identification of similar or near-duplicate pai...
متن کاملEfficient Summarization of URLs using CRC32 for Implementing URL Switching
We investigate methods of using CRC32 for compressing Web URL strings and sharing of URL lists between servers, caches, and URL switches. Using trace-based evaluation, we compare our new CRC32 digesting method against existing Bloom filter and incremental CRC19 methods. Our CRC32 method requires less CPU resources, generates equal or smaller size digests, achieves equal collision rates, and sim...
متن کاملQuery-URL Bipartite Based Approach to Personalized Query Recommendation
Query recommendation is considered an effective assistant in enhancing keyword based queries in search engines and Web search software. Conventional approach to query recommendation has been focused on query-term based analysis over the user access logs. In this paper, we argue that utilizing the connectivity of a query-URL bipartite graph to recommend relevant queries can significantly improve...
متن کاملReliable Evaluations of URL Normalization
URL normalization is a process of transforming URL strings into canonical form. Through this process, duplicate URL representations for web pages can be reduced significantly. There are a number of normalization methods. In this paper, we describe four metrics for evaluating normalization methods. The reliability and consistency of a URL is also considered in our evaluation. With the metrics pr...
متن کاملWhat’s in a URL? Genre Classification from URLs
The importance of URLs in the representation of a document cannot be overstated. Shorthand mnemonics such as “wiki” or “blog” are often embedded in a URL to convey its functional purpose or genre. Other mnemonics have evolved from use (e.g., a Wordpress particle is strongly suggestive of blogs). Can we leverage from this predictive power to induce the genre of a document from the representation...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Engineering & Technology
سال: 2018
ISSN: 2227-524X
DOI: 10.14419/ijet.v7i3.12.16107